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1. INTRODUCTION 

Communication or interaction is a basic need of human life as a social being. Human 
communication needs to aim to observe the environment so that humans can survive and adapt to the 
environment [1]. The communication process requires an instrument as a connector for information. Today, 
mass media communication has become a vital element in leading and changing human opinion. In the past, 
media were described as newspapers, magazines, radio, films, and television. However, with current 
technological advances, mass media is associated with the use of the internet in interactive mass media. 
Today's communication technology plays a role in sustaining the shift from conventional media to modern 
media [2]. 

In recent years, opinion and sentiment mining has been automated using online messages, such as 
Twitter threads, news, and product reviews. This research utilizes Indonesian language threads or tweets that 
talk about immigration services in Indonesia. Sentiment analysis is a method whose implementation utilizes 
data in the form of text by evaluating and identifying feelings and opinions, both positive and negative [3]. 
Twitter users can provide objective opinions on various topics or issues [4]. One of the earliest studies on 
Twitter sentiment analysis was conducted by [5], who considers problems as two classes of classification and 
characterizes tweets as positive or negative. Researchers conducted sentiment analysis on reviews using the 
Naive Bayes (NB), support vector machine (SVM), and logistic regression (LR) model. SVM has been 
widely used for classification and regression. Theoretically and practically, this algorithm has proven its 
achievements in various domains [6]. 
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A study has proven the effectiveness of SVM for processing Arabic tweet data [7] with satisfactory 
results. Another study was conducted by [8] using a dataset for sentiment analysis with NB and SVM. 
Alves et al. [8] also presents the method used to classify the polarity of tweet sentiment by considering 
spatial and temporal information. Kharde and Sonawane [9] prove the effectiveness of SVM and NB in 
sentiment analysis data through results and tables collected using datasets from Twitter. Troussas et al. [10] 
also used NB to classify Facebook status and the results were compared with the rocchio classifier and 
perceptron classifier. The results showed that NB has a better precision level of 77%, compared to the last 
two classifiers. The NB method can also be combined with the feature selection method of the genetic 
algorithm. Muthia [11] have been done by using hotel review data. As a result, the original NB method 
obtained an accuracy rate of 78.5% and after being combined with feature selection from the genetic 
algorithm the accuracy rate became 83%. Martiti and Juliane [12] also found that the accuracy of NB used in 
the sentiment analysis application they made was 86.6%. 

The Directorate General of Immigration (Ditjenim) is a government agency that provides public 
services in the field of immigration, is also uses Twitter as a communication instrument. Not only that, but 
the technical service units (UPT) spread across Indonesia also have accounts Twitter in order to realize good 
governance. Twitter as a medium of relations must be able to accommodate providing two-way 
communication facilities between government administrators and the community. Departing from the number 
of internet and Twitter users in Indonesia, of course, this is a big potential for the use of big data by the 
Ditjenim and other UPT immigration. Big data is a system that unites the real world, humans, and the virtual 
world (social media) [13]. Sourced from the data record of user conversations on Twitter, if the processing is 
carried out, of course, it will produce a certain pattern or characteristic of information. This can be used in 
formulating strategies, research, and market (Community) responses to an immigration service or product. 

Ditjenim has great potential in processing and utilizing big data, the article is that there are 126 UPT 
spread throughout Indonesia that provide immigration services in the form of issuing passports, visas, and 
residence permits, of course, it requires professional data handling related to public complaints. Big data 
contained in social media is an unstructured form of data [14] and has no pattern or schema [15]. Text mining 
is used to process unstructured and patterned text data [16]. 

Complaints in the form of tweets when compiled and analyzed can provide important information 
that has characteristics or patterns to certain trends or issues related to immigration services. The data can be 
used as a means of conducting sentiment analysis on various immigration policies, so that it can provide 
feedback for institutions to improve in the future. In addition to SVM, this study also uses a highly 
probabilistic NB classifier model. This method is simple but powerful because it has a high value of accuracy 
and performance in classification [17]. The NB method can be used to polarize sentiment into positive and 
negative categories [10]. Likewise, the LR model which is a supervised learning algorithm and can be used to 
classify text data also applied to this research so that the performance of the three algorithms can be 
compared. The comparison is done by looking at the confusion matrix and the area under curve (AUC) value 
on the receiver operating characteristic (ROC) curve of each algorithm. Classification quality is seen from the 
AUC value which is divided into several groups [18], 0.90-1.00 for very good classification, 0.80-0.90 for 
good classification, 0.70-0.80 for adequate classification, 0.60-0.70 for poor classification, and 0.50-0.60 for 
the wrong classification. 


2. METHOD 

Singh and Dubey [19] have conducted a literature study on sentiment analysis and opinion research 
on social issues. The selected newspapers have extracted data from the website. They argued that different 
types of classification techniques when combined can produce better results. Akbani et al. [20] classified the 
sentiment of tweets in Arabic using NB, decision trees, and SVM. In this study, the framework for 
classifying Arabic tweets consists of several subtasks such as term frequency inverse document frequency 
(TF-IDF) and Arabic mood. 

In addition, three information-seeking metrics were used for performance evaluation: precision, 
recall, and F-score. Shoukry and Rafea [21] focuses on the effect of preprocessing features in the sentiment 
classification process. Ahmad and Aftab [22] analyzed the performance of the SVM for polarity detection 
from textual data. Davidov et al. [23] using an SVM classifier prepared with eleven features for transient 
stability assessment (TSA). With the ability to survey public opinion (sentiment) on a subject, data can be 
collected and analyzed from social media such as Twitter in real-time [24]. Social media sentiment analysis 
has been widely used in the topics studied [25]-[28]. Community or customer understanding of the 
perception of a product is very useful for business marketing strategies [29], [30]. 

The workflow of this paper is shown in Figure 1. After we collect data from Twitter related to 
Indonesian immigration and its passport services, we conducted a data understanding of the process. In this 
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step, we try to understand the data we have and identify potential problems that exist in the dataset. The next 
stage is data preparation. At this stage, we perform several steps. We perform cleaning, case-folding, 
tokenizing, filtering, and stemming until we get data which we call preprocessed data before sentiment 
polarization is carried out. Then after knowing the polarization of sentiment, we did modeling using the three 
models we mentioned earlier, NB, SVM, and LR, we evaluated the results with a confusion matrix and ROC 
curve to find out which model has the best performance. 


Data Data 


Understanding > Preparation 


Text Cleaning Case-Folding Tokenizing 


Filtering Stemming 


| 


Text Pre- Sentiment i 
cane Modeling 
processed Polarization 
Evaluation 


Figure 1. Research workflow 


2.1. Data understanding 

Data understanding is conducted to understand the dataset. We first import it into our jupyter 
notebook workspace and read it as shown in Figure 2. Then we use the pandas library to see the details of the 
dataset. This is the original data that we got and we call it raw data. 


Text 


#Repost @ditjen_imigrasi\n_..\nSahabat Mido, p... 

Reposted from @ditien_imigrasi Sahabat Mido, p... 
@ditjen_imigrasi Saya sudah mengisi data pada ... 
@sashaasays @ditjen_imigrasi Bukannya emg yg b... 
@tempodotco "Resmi" berarti masuk melalui imig... 

Imigrasi Ngurah Rai Bali Tangkap WNA Nigeria y... 
@Kemenkumham_RI @Kumham_Sulsel Kemenkumham Sul... 
Imigrasi Blitar ungkap peningkatan permohonan .. 


@Kemenkumham_RI @Kumham_Sulsel Kemenkumham Sul... 


oon OH & UNa O 


RT @elrinyuliana: @saidiman sama pilih beberap. 


Figure 2. Raw data 


Data understanding is the first step in data analysis. The data is checked so that it will be known 
what problems exist in the data. In addition, a summary and identification of potential problems can also be 
made. This stage must be done carefully because it will determine the results in the next stage. The summary 
is used as a reference to ensure the data distribution is appropriate, or it can also be used to find out the 
deviations that must be handled in data preparation. Problems like null, outliers, and bad data density can be 
fixed in data preparation [31]. After understanding the raw data, then we check for duplicated tweets or if 
there are any duplicates. The duplication of the tweets may happen among the huge of data. The result is 
shown in Figure 3. 

According to Figure 3, we can see that there are 4,809 duplicated tweets. We use the duplicate 
function from the pandas library. After we know the duplicate, we delete it all in the data preparation stage. 
Then at this data understanding stage, we also check how much data we have and check if there are empty 
data or null. Null data is a noise that can be a problem in the analysis. For the best result, we should make our 
data clean. We found that in our dataset, there are 10,000 tweet rows and there are no blank rows or data. We 


The performance of Naive Bayes, support vector machine, and logistic regression on ... (Priati Assiroj) 


3846 O ISSN: 2302-9285 


did not find the null data, so we can use this data for analysis. If we found the null data, we must delete it and 
make sure again that there are no null data. 


Text 
991 RT @SeruniPuspaAlam: Negara asing akan pikir-p... 
994 RT @SeruniPuspaAlam: Negara asing akan pikir-p... 


999 sore ini ya. 
1000 @kompascom @tempodotco @repu... 
1002 sore ini ya. 


9978 1. Kakanwil Kemenkumham Sumsel Harun Sulianto ... 
9980 Masih inget kejadiannya ya @Dennysiregar7?\nYa... 
9981 Bokep Indo Doyan Sperma\n\nBokep Indo Viral Sk... 
9992 1. Kakanwil Kemenkumham Sumsel Harun Sulianto ... 


9998 Bokep Indo Doyan Sperma\n\nBokep Indo Viral Sk... 


4809 rows x 1 columns 


Figure 3. Duplicate tweets 


2.2. Data preparation 

This stage is done to fix the problems that existed in the previous stage. This stage is also a 
determinant of the suitability of the data to the algorithm to be used because ideally this stage is reviewed 
repeatedly when problems occur in the modeling until an appropriate one is found. Activities include data 
selection, transformation, and data cleansing so that the data is finally ready for modeling [32]. One of the 
actions we take is to delete data that has duplicates. We found that we have 4,809 rows, as shown in Figure 3, 
have been deleted, then leaving 5,191 data to be used for this research. The deleted data is data that is 
duplicated as we said in the previous stage. We now have data that do not have null and duplicate. 


2.2.1. Text cleansing 

At this stage, we perform data cleansing after the duplicate data is deleted. We use the re library 
which is already available in the python programming language. We also use the natural language toolkit 
(NLTK) library to tokenize and remove stopwords, then we use the Sastrawi library for text processing in 
Indonesian. The text cleaning function cleans text from unnecessary characters such as excessive spaces, 
symbols, numbers, links, hashtags, and mentions. Then we continue by creating the functions needed for the 
next process, namely case-folding, tokenizing, filtering, and stemming. These functions are needed to do text 
analysis. 


2.2.2. Case-folding 

Case-folding refers to the process of converting text to a standard lowercase form and removing any 
distinctions between uppercase and lowercase letters. It means the process will convert all text to lowercase. 
This process uses the lower function in the string library. The result is lowered text data. Then after we gain 
the case-folded data, we do the tokenization process. 


2.2.3. Tokenizing 

Tokenizing is the process to encode words. The words in the text column will be grouped word by 
word into an array of strings. This process uses the word tokenize function in the NLTK library. The 
sentences will be separated into words. The result is text data that consist of words from separated sentences. 


2.2.4. Filtering 

In this process, we deliberately filter out only Indonesian tweets that will be used for research. The 
filtering process is conducted to get tweets that are only in Indonesian. This process utilizes the corpus and 
stopword functions provided by the NLTK library. The filtered data is then selected for the next process, 
which is word stemming. 
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2.2.5. Stemming 

The final step in text processing is to convert all the words in the existing text to their basic form. 
For example, the word “reading” will be changed to “read”. This process takes quite a long time depending 
on the number of datasets. After all the text processing functions are created, then we apply these functions 
one by one so that we get the data as shown in Figure 4. The data that we have cleaned is then saved with the 
name 'cleaned_data.csv', and in the next process, we use this dataset. Figure 4 is the data that is ready for the 
next process. But as seen in the column 'text_preprocessed' there are still some special characters such as 
quotes, which we should remove. The next process is to remove some special characters that still exist as 
shown in Figure 5. 


text_clean text_preprocessed 
0 imigrasisahabat mido pengen ubah nama panggila... [imigrasisahabat’, ‘mido’, 'ken’, ‘ubah’, ‘na... 


14 reposted from imigrasi sahabat mido pengen uba..._‘[‘reposted’, ‘from’, ‘imigrasi’, 'sahabat', 'm... 


2 imigrasi saya sudah mengisi data pada app m pa... [imigras/, ‘isi’, ‘data’, ‘app’, 'm’, ‘paspo... 
3 imigrasi bukannya emg yg buat bali udh bisa sa... [imigrasi’, ‘emg’, ‘yg’, ‘bal’, ‘udh’, 'sa’, ... 
4 resmi berarti masuk melalui imigrasi mana nih [resmi', 'masuk’, ‘imigrasi’, 'nih’] 
5186 optimalkan nilai ikpa kantor imigrasi kelas i .. [optimar, ‘nilai’, 'ikpa', ‘kantor’, 'imigra... 
5187 halo sahabat mido untuk menjawab kebutuhan pas... ['halo', ‘sahabat’, 'mido', 'butuh’, ‘paspor, ... 
5188 kayanya hari ini akan lembur sampe sahur lagi ... ['kaya’, ‘lembur’, 'sampe’, 'sahur’, 'gapapa’.... 
5189 jakarta—kantor imigrasi kelas i khusus non tpi... [jakarta kantor’, 'imigrasi’, 'kelas', ‘i’, '.. 


5190 halo sahabat midoterimakasih banyak atas apres... [halo’, ‘sahabat, 'midoterimakasih’, ‘apresi... 


5191 rows x 2 columns 


Figure 4. Results application functions that have been made 


text_clean text_preprocessed 


0 imigrasisahabat mido pengen ubah nama panggila... [imigrasisahabat, mido, ken, ubah, nama, pangg... 


1 reposted from imigrasi sahabat mido pengen uba... [reposted, from, imigrasi, sahabat, mido, ken,... 
2 imigrasi saya sudah mengisi data pada app m pa... [imigrasi, isi, data, app, m, paspor, mentok, ... 
E | imigrasi bukannya emg yg buat bali udh bisa sa... [imigrasi, emg, yg, bal, udh, sa, kalo, cgk, e.. 
4 resmi berarti masuk melalui imigrasi mana nih [resmi, masuk, imigrasi, nih] 
5186 optimalkan nilai ikpa kantor imigrasi kelas i ... [optimal, nilai, ikpa, kantor, imigrasi, kelas... 
5187 halo sahabat mido untuk menjawab kebutuhan pas... [halo, sahabat, mido, butuh, paspor, sahabat, ... 
5188 kayanya hari ini akan lembur sampe sahur lagi ... [kaya, lembur, sampe, sahur, gapapa, ri, imigr... 
5189 jakarta—kantor imigrasi kelas i khusus non tpi... [jakarta, kantor, imigrasi, kelas, i, khusus, ... 
5190 halo sahabat midoterimakasih banyak atas apres... [halo, sahabat, midoterimakasih, apresiasi, pe... 


5191 rows x 2 columns 


Figure 5. Pre-processed text 


2.3. Text pre-processed 

Text preprocessing refers to a set of techniques and steps applied to raw textual data before it is used 
for further analysis or natural language processing tasks. It involves transforming the text into a format that is 
more suitable and efficient for subsequent processing. Text preprocessing typically includes tasks such as 
removing punctuation, converting to lowercase, tokenization (splitting text into individual words or tokens), 
removing stop words (commonly used words that do not carry significant meaning), stemming or 
lemmatization (reducing words to their base or root form), and handling special characters or encoding 
issues. The goal of text preprocessing is to clean and standardize the text data, making it easier to analyze and 
derive meaningful insights. Figure 5 shows that the data in the text_processed column is ready for sentiment 
analysis. The process is done by making sentiment polarity based on the Indonesian language lexicon 
dictionary. This lexicon dictionary is generally available on the internet and can be downloaded by anyone 
who needs it. We use two lexicon dictionaries, namely positive and negative. 
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2.4. Sentiment polarization 

Sentiment polarity is an expression that defines the sentimental aspect of an opinion. In text data, the 
results of sentiment analysis can be determined for each entity in a sentence, document, or sentence. Mood 
polarity can be defined as positive, negative, or neutral [33]. The fundamental task of sentiment analysis is to 
classify whether the opinion expressed in a document, sentence, or entity attribute/aspect is positive, 
negative, or neutral [34]. The polarity of sentiment for an item determines the orientation of the expressed 
sentiment; determines whether the text expresses the user's positive, negative, or neutral feelings toward the 
entity in question [35]. 

Figure 6 shows the defined function in python to align data with the Indonesian lexicon dictionary. 
Then we define a function for sentiment analysis in Indonesian whose data polarity is based on the lexicon 
dictionary and we group it into positive if the value in the lexicon dictionary is more than 0, negative if the 
value in the lexicon dictionary is less than 0, and neutral if the value in the lexicon dictionary is the same 
with O and the result is shown in Figure 7. The result shows that negative sentiment is 3,655, positive 
sentiment is 973, and neutral is 563. We also provide a sample of the polarized data that shows a sample of 
these three types of sentiments. Figure 7 is an illustration of the polarization obtained from the data regarding 
the existing lexicon dictionary. 


» def sentiment_analysis_lexicon_indonesia(text): » #hasil polarisasi sentimen 
score = @ 
for word in text: 


4f(word in lexicon positive): results = tweets[‘text_preprocessed'].apply(sentiment_analysis_lexicon_indonesia) 


score = score + lexicon_positive[word] results = list(zip(*results)) 
for word in text: 
Et (word in Jexiconmepative): tweets[‘polarity_score'] = results[@] 
score = score + lexicon_negative[word] tweets['polarity'] = results[1] 
polarity = '' print(tweets['polarity'].value_counts()) 
if(score > 0): 
polarity = ‘positive’ tweets 
elif (score < @): 
ed = ‘negative’ negative 3655 
polarity = ‘neutral’ positive 973 
return score, polarity neutral 563 


Name: polarity, dtype: int64 


Figure 6. Sentiment polarity Figure 7. Polarization result 


We have known the data with its polarity then we visualized the data to see it deeply as shown in 
Figure 8. Figure 8 shows that as many as 70.4% of Twitter users have negative sentiments about Indonesian 
immigration, 18.7% have positive sentiments, and 10.8% are neutral. From the Figure 8, it can be seen that 
the sentiment data for the modeling process has not been balanced. There are around 2,682 data that need to 
be manipulated so that the sentiment dataset is balanced and can be used for modeling. By using the NumPy 
library we do the dataset balancing process before modeling. Data balancing is conducted by including 
positive and negative polarizations. Then we visualize the data that we have balanced. Visualization is 
conducted to see in detail the percentage of each data. The visualization results can be seen in Figure 9. 
Figure 9 shows that the existing sentiment data is balanced. The positive sentiment is 50% and the negative 
sentiment is also 50%. So that it is ready for the next step, which is modeling. 
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Figure 8. Polarity Figure 9. Balanced visualization 


3. RESULTS AND DISCUSSION 
3.1. Modeling 

As previously explained, this study uses a NB algorithm, SVM, and LR for sentiment analysis with 
70% data division as training data and 30% as testing data. The results of each model are shown in Figure 10 
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for NB, Figure 11 for the SVM, and Figure 12 for the LR model. Figure 10 shows that the accuracy of the 
NB model is 70%, while the precision values are 85% and 64%, respectively. Figure 11 shows that the 
accuracy of the SVM model is 76% higher than the NB model, while the precision values are 82% and 72%, 
respectively. Figure 12 shows that the accuracy of the LR model is 77% higher than the NB model and SVM, 
while the precision values are 81% and 74%, respectively. 


#SVM Model 


#Bayesian Model 
from sklearn.svm import LinearSVC 


BNBmodel = BernoulliNB() 
BNBmodel.fit(X_train, y_train) 
model_Evaluate(BNBmodel) 

y_pred1 = BNBmodel.predict(X_test) 


SvCmodel = Linearsvc() 
svCmodel.fit(X_train, y_train) 
model _Evaluate(SVCmodel) 

y_pred2 = SvCmodel.predict(X_test) 


precision recall fi1-score support precision recall f1-score 
8 8.85 0.47 0.60 62 
1 8.64 8.92 0.76 64 1 
accuracy 0.70 126 accuracy 
macro avg 0.75 0.69 0.68 126 RENO: NG. 
weighted avg 8.75 8.78 0.68 126 weighted avg 
Figure 10. Result of NB Figure 11. Result of SVM 


#Llogistic Regression Model 


LRmodel = LogisticRegression(C = 2, max_iter = 10800, n_jobs=-1) 
LRmodel.fit(X_train, y_train) 

model_Evaluate(LRmodel) 

y_pred3 =» LRmodel.predict(X_test) 


precision recall fi-score support 


@ 8.81 8.69 75 62 
1 74 0.84 +79 64 


accuracy 
macro avg 
weighted avg 


Figure 12. Result of LR 


3.2. Evaluation 

To evaluate the used model, we use the confusion matrix that we get for each model along with its 
ROC curve. This is the confusion matrix of each model. Figure 13 is a confusion matrix from NB, Figure 14 
is a SVM, and Figure 15 is a confusion matrix from LR. From the NB confusion matrix in Figure 13, we can 
see that the prediction result for the true positive is 46.83%, the true negative is 23.02%, the false positive is 
26.19%, and the false negative is 3.97%. In Figure 14, SVM confusion matrix, the prediction result for true 
positive is 43.65%, for true negative is 32.54%, for false positive is 16.67%, and for false negative is 7.14%. 
In Figure 15, LR confusion matrix, the prediction result for true positive is 42.86%, for true negative is 
34.13%, for false positive is 15.08%, and for false negative is 7.94%. 


Confusion Matrix Confusion Matrix Confusion Matrix 
L 50 50 
3 i g z 4s 
s A False Posn16.67% as False Posn15.08% 
22 b g 3 40 2 40 
Z T g 3 
> 2 = 
T 30 T 30 S 30 
B Z y 25 
Z v ay <3 
+: 2 Se Negn3.97% 20 È- False Negn7.14% Læ A False Negn7.94% i 
f £ £ 
-15 
i -10 ' -10 
f Negative Positive 
Negative Positive negative Positive 
i Predicted values Predicted values 
Predicted values 
Figure 13. NB Figure 14. SVM Figure 15. LR 


Hereafter we also obtain a visualization of the ROC curve from the NB model. The ROC curve from 
the NB model can be seen in Figure 16, for SVM model can be seen in Figure 17, and for LR model in 
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Figure 18. The ROC curve shows the performance of classification models at all classification thresholds. 
From this curve, we know the AUC graphs. This graph is located between the true positive rate with 
sensitivity on y-axis and the false positive rate with specificity on x-axis [36]. It seems a bid among this both 
sensitivity and specificity. From the curve, in Figure 16 we can see that the AUC value of the NB model is 
0.69. AUC provides an aggregate measure of performance across all possible classification thresholds. 
Figure 17 shows that the AUC value of the SVM model is 0.76, which is higher than the NB model. 
Figure 18 provides information that the AUC value of the LR model is 0.77, higher than the NB model and 
SVM. 


ROC CURVE ROC CURVE 
10 10 
08 08 
8 u 
i 2 
Gd £ 
g 064 v 064 
$ E 
u 044 £ 044 
= g 
0.2 02 
ROC curve (area = 0.69) ROC curve (area = 0.76) 
0.0 +— T T T T — oo 4 : z = - 
S = iyi Positive =a = = a 02 ii shi 08 10 
False Positive Rate 
Figure 16. ROC of NB Figure 17. ROC of SVM 
ROC CURVE en 
C9 
104 
084 
wu 
ie] 
joc 
v 064 
= 
a 
£ 
u 044 
je 
0.2 4 
—— ROC curve (area = 0.77) 
0.0 + T T T T 
0.0 0.2 0.4 0.6 0.8 10 
False Positive Rate 


Figure 18. ROC of LR 


4. CONCLUSION 

From testing using three models, NB, SVM, and LR, we can conclude that accuracy: in terms of 
model accuracy, LR outperforms SVM which in turn outperforms Bernoulli NB. LR has 77%, SVM 76%, 
and Bayesian 70%. Fl-score: the Fl-scores are: i) for class 0: Bernoulli NB (accuracy=0.60)<SVM 
(accuracy=0.73)<LR (accuracy=0.75) and ii) for class 1: Bernoulli NB (accuracy=0.76)<SVM and LR 
(accuracy=0.79). AUC score: NB model has 0.69 AUC score, SVM model has 0.76 AUC score, and LR 
model has 0.77 AUC score. We conclude that the best model for the given dataset is LR. In our problem, LR 
follows Occam's Razor principle, which defines that for a given problem statement, if the data has no 
assumptions, the simplest model will work best. Since our data set has no assumptions and LR is a simple 
model, the concept applies to the above dataset. 
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